deep gp
- North America > Canada > Ontario > Toronto (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective
Large width limits have been a recent focus of deep learning research: modulo computational practicalities, do wider networks outperform narrower ones? Answering this question has been challenging, as conventional networks gain representational power with width, potentially masking any negative effects.
Supplementary Information for: The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective
In this section we discuss various Deep GP facts presented throughout the main paper. Here we formalize the claim that Deep GP have "infinite capacity." To show that the neural network defined in Eq. (1) is a (degenerate) Deep GP, we must show that each of its layers corresponds to a (degenerate, vector-valued) Gaussian process. GP, their covariance functions only correspond to a finite basis and therefore they do not have the same properties as nonparametric Deep GP (i.e. the ability to model any function to arbitrary precision). We also note that Deep GP and (Bayesian) neural networks can both be generalized to other hierarchical models, such as Deep Kernel Processes [3].
- North America > Canada > Ontario > Toronto (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
Deep Q-Exponential Processes
Chang, Zhi, Obite, Chukwudi, Zhou, Shuang, Lan, Shiwei
Motivated by deep neural networks, the deep Gaussian process (DGP) generalizes the standard GP by stacking multiple layers of GPs. Despite the enhanced expressiveness, GP, as an $L_2$ regularization prior, tends to be over-smooth and sub-optimal for inhomogeneous subjects, such as images with edges. Recently, Q-exponential process (Q-EP) has been proposed as an $L_q$ relaxation to GP and demonstrated with more desirable regularization properties through a parameter $q>0$ with $q=2$ corresponding to GP. Sharing the similar tractability of posterior and predictive distributions with GP, Q-EP can also be stacked to improve its modeling flexibility. In this paper, we generalize Q-EP to deep Q-EP to enjoy both proper regularization and improved expressiveness. The generalization is realized by introducing shallow Q-EP as a latent variable model and then building a hierarchy of the shallow Q-EP layers. Sparse approximation by inducing points and scalable variational strategy are applied to facilitate the inference. We demonstrate the numerical advantages of the proposed deep Q-EP model by comparing with multiple state-of-the-art deep probabilistic models.
- North America > United States > New York (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective
Large width limits have been a recent focus of deep learning research: modulo computational practicalities, do wider networks outperform narrower ones? Answering this question has been challenging, as conventional networks gain representational power with width, potentially masking any negative effects. In doing so, we aim to understand how width affects (standard) neural networks once they have sufficient capacity for a given modeling task. Our theoretical and empirical results on Deep GP suggest that large width can be detrimental to hierarchical models. Surprisingly, we prove that even nonparametric Deep GP converge to Gaussian processes, effectively becoming shallower without any increase in representational power.
Gradient-enhanced deep Gaussian processes for multifidelity modelling
Bone, Viv, van der Heide, Chris, Mackle, Kieran, Jahn, Ingo H. J., Dower, Peter M., Manzie, Chris
Multifidelity models integrate data from multiple sources to produce a single approximator for the underlying process. Dense low-fidelity samples are used to reduce interpolation error, while sparse high-fidelity samples are used to compensate for bias or noise in the low-fidelity samples. Deep Gaussian processes (GPs) are attractive for multifidelity modelling as they are non-parametric, robust to overfitting, perform well for small datasets, and, critically, can capture nonlinear and input-dependent relationships between data of different fidelities. Many datasets naturally contain gradient data, especially when they are generated by computational models that are compatible with automatic differentiation or have adjoint solutions. Principally, this work extends deep GPs to incorporate gradient data. We demonstrate this method on an analytical test problem and a realistic partial differential equation problem, where we predict the aerodynamic coefficients of a hypersonic flight vehicle over a range of flight conditions and geometries. In both examples, the gradient-enhanced deep GP outperforms a gradient-enhanced linear GP model and their non-gradient-enhanced counterparts.
- Oceania > Australia > Queensland (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Modeling & Simulation (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.46)
The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective
Pleiss, Geoff, Cunningham, John P.
Large width limits have been a recent focus of deep learning research: modulo computational practicalities, do wider networks outperform narrower ones? Answering this question has been challenging, as conventional networks gain representational power with width, potentially masking any negative effects. Our analysis in this paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP), a class of hierarchical models that subsume neural nets. In doing so, we aim to understand how width affects standard neural networks once they have sufficient capacity for a given modeling task. Our theoretical and empirical results on Deep GP suggest that large width is generally detrimental to hierarchical models. Surprisingly, we prove that even nonparametric Deep GP converge to Gaussian processes, effectively becoming shallower without any increase in representational power. The posterior, which corresponds to a mixture of data-adaptable basis functions, becomes less data-dependent with width. Our tail analysis demonstrates that width and depth have opposite effects: depth accentuates a model's non-Gaussianity, while width makes models increasingly Gaussian. We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP. These results make strong predictions about the same phenomenon in conventional neural networks: we show empirically that many neural network architectures need 10 - 500 hidden units for sufficient capacity - depending on the dataset - but further width degrades test performance.
Interpretable deep Gaussian processes
Lu, Chi-Ken, Yang, Scott Cheng-Hsin, Hao, Xiaoran, Shafto, Patrick
We propose interpretable deep Gaussian Processes (GPs) that combine the expressiveness of deep Neural Networks (NNs) with quantified uncertainty of deep GPs. Our approach is based on approximating deep GP as a GP, which allows explicit, analytic forms for compositions of a wide variety of kernels. Consequently, our approach admits interpretation as both NNs with specified activation functions and as a variational approximation to deep GPs. We provide general recipes for deriving the effective kernels for deep GPs of two, three, or infinitely many layers, composed of homogeneous or heterogeneous kernels. Results illustrate the expressiveness of our effective kernels through samples from the prior and inference on simulated data and demonstrate advantages of interpretability by analysis of analytic forms, drawing relations and equivalences across kernels, and a priori identification of non-pathological regimes of hyperparameter space.